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The Normalized Mutual Information (NMI) has been widely used to evaluate the accuracy of 
community detection algorithms. However in this article we show that the NMI is seriously affected 
by systematic errors due to finite size of networks, and may give a wrong estimate of performance 
of algorithms in some cases. We give a simple theory to the finite-size effect of NMI and test our 
theory numerically. Then we propose a new metric for the accuracy of community detection, namely 
the relative Normalized Mutual Information (rNMI), which considers statistical significance of the 
NMI by comparing it with the expected NMI of random partitions. Our numerical experiments 
show that the rNMI overcomes the finite-size effect of the NMI. 


Detection of community structures, which asks to group nodes in a network into groups, is a key problem in network 
science, computer science, sociology and biology. Many algorithms have been proposed for this problem, see ref. [1] for 
a review. However on a given network, different algorithms usually give different results. Thus evaluating performance 
of these algorithms and finding the best ones are of great importance. 

Usually the evaluations are performed on benchmark networks each of which has a reference partition. These 
benchmarks include networks generated by generative models, like Stochastic Block Model [2] and LFR model [3], 
with a planted partition as the reference partition; and some real-world networks, like the famous Karate Club network 
[1] and the Political Blog network [^, with a partition annotated by domain experts as the reference partition. The 
accuracy of a community detection algorithm is usually represented using similarity between the reference partition 
and partition found by the algorithm — the larger similarity, the better performance the algorithm has on the 
benchmark. 

Without losing generality, in what follows we call the reference partition A and the detected partition B, and our 
task is to study the measure of similarity between partition A and partition B. 

When the number of groups are identical, qa = Qb = Q, the similarity can be easily dehned by the overlap, which 
is the number of identical group labels in A and B maximized over all possible permutations: 

0(B,B) = max [ - , (1) 

where n is number of nodes, <5 is the Kronecker delta function and tt ranges over all permutations of q groups. 

However we can see that the overlap is non-zero even if partition B is a random partition: there are roughly ^ 
identical labels in two partitions if labels are distributed randomly and uniformly. One way to refine it is to normalize 
the overlap to scale from 0 to 1 0[7!: 


However despite its simplicity, using overlap as the similarity has two problems: first, when number of groups q is 
large, maximizing overlap over q\ permutations is difficult; second, when number of partitions, qA and qs, are not 
identical, the overlap is ill-defined. 

Another well-accepted measure of similarity is the Normalized Mutual Information (NMI) [51 [5], which is well- 
dehned even when qA ^ qs- Many studies use NMI to evaluate their algorithms or to compare different algorithms 
mm- To define NMI we need to approximate the marginal probability of a randomly selected node being in group 
a and b by Pa{o) = ^ and Bb(&) = where Ua and rih denote group size of A and B. 

We know that the spirit of Mutual Information is to compute the dependence of these two distributions, by 
computing Kullback-Leibler (KL) distance between joint distribution PABici,b) and the product of two marginal 
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distributions PA{a)PB{b): 


Iab{Pa, Pb) 


Qa qb 

EE PAB{a,b) log 

a—1b—1 


_PAB(jh^^ 

PA{a)PB{b)' 


( 3 ) 


Due to the property of KL distance, this quantity is non-negative. Iab{Pa^Pb) = 0 implies that Pa and Pb are 
independent, that detected partition has nothing to do with the ground-true partition. And if Iab{Pa, Pb) is much 
larger than zero, the detected partition and the ground true partition are similar. In practice the joint distribution 
can also be approximated by frequencies 

PAB{a,b) = -, 

n 

where Uat is number of nodes that both in group a of partition A and in group b of partition B. Eq.([^ can be written 
as 


I{Pa,Pb) = H{Pa) + H{Pb) - H{Pab), 


( 4 ) 


where 


H{Pa) = -Y.PA{a) logP^(a) 

a 

is the Shannon entropy of distribution Pa, and H{Pab) is the entropy of the joint distribution Pab- Note that using 
conditional distribution PA\B{o\b) = PAB{o-,b)/PB{b), one can rewrite Eq. (§ as 

I[Pa,Pb)=H{Pa)-H{Pa\b), ( 5 ) 

which has a interpretation that amount of information (surprise) gained on distribution Pa after known B. If this 
information gain is 0, it means knowledge of B does not give any information about A, then two partitions has nothing 
to do with each other. Obviously the larger Iab{Pa,Pb), the more similar two partitions are. However this is still 
not a ideal metric for evaluating community detection algorithms since it is not normalized. As proposed in [5], one 
way to normalize it is to choose normalization as H{Pa) + H{Pb), and the Normalized Mutual Information is written 
as 


NMI(P^,Pb) 


2I{Pa,Pb) 

H{PA)+H{PBy 


( 6 ) 


Since H{Pab) < H{Pa) + H{Pb), NMI{Pa, Pb) is bounded below by 0. Also note that H{Pab) = H{Pa) = H{Pb) 
when A and B are identical, which means in this case NMI(P^, Pb) = 1. 

After NMI was introduced as a metric for comparing community detection algorithm, it becomes very popular in 
evaluating community detection algorithms. However in some cases we find that this metric gives un-consistent results. 
One example is shown in Fig. left where we compare NMI between partitions obtained by four algorithms and the 
planted partition in the stochastic block model with parameter e = 1. These four algorithms are Label Propagation 
m, Infomap m, Louvain method m and Modularity Belief Propagation [TS] respectively. The principle of the 
algorithms are different: Label Propagation algorithm maintains a group label for each node by iteratively adopting the 
label that most of its neighbors have; Infomap method compresses a description of information flow on the network; 
Louvain method maximizes modularity by aggregation; and Modularity Belief Propagation detects a statistically 
significant community structures using the landscape analysis in spin glass theory of statistical physics. 

The stochastic block model (SBM) is also called the planted partition model. It has a planted, or ground-true, 
partition with q groups of nodes, each node i has a group label t* G {1,...,(?}. Edges in the network are generated 
independently according to a, q x q matrix p, by connecting each pair of nodes {ij) with probability Here we 

consider the commonly studied case where the q groups have equal size and where p has only two distinct entries, 
Prs = Pin/n if r = s and Pout/n if r ^ s. We use e = Pout/Pin to denote the ratio between these two entries, the larger e 
the weaker the community structure is. With e = 0 the network is essentially composed of two connected components, 
while with e = 1, as in Fig. [^left, the network is deep in the un-detectable phase [5], and are essentially random 
graphs. In the later case though in each network there is a planted partition, the partition is not detectable in the 
sense that no algorithm could be able to find it or even find a partition that is correlated with it. This un-detectability 
of the planted partition in SBM has been proved for q = 2 groups in m- 

However from Fig. we can see that only Modularity BP gives zero NMI on all networks that is consistent with 
the un-detectability we described above, while other three algorithms give positive NMI, and Infomap gives a quite 
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FIG. 1: (Color online) Normalized Mutual Information (left) and number of groups {right) given by four algorithms on networks 
generated by Stochastic Block Model with e = = 1 (see text for the description of parameters of SBM), which is deep in 

the undetectable phase of SBM with different system sizes. These algorithms are (from top to bottom in left panel) Infomap 
m, Label Propagation m, Louvain method m and Modularity Belief Propagation m ■ These networks are essentially 
random graphs, though in each network there is a planted partition, we expect no algorithm can obtain any information about 
it. Each point in the figure is averaged over 10 realizations. 


large NMI on all networks. Then the question arises: does it mean Infomap could do a theoretically-impossible job 
of finding the planted configuration in the un-detectable phase of SBM? We think it is not the case, but the problem 
comes from NMI, the measure for how much one finds about the planted configuration. 

In Fig. [fright we plot the number of groups found by different algorithms, then we can see that only Modularity 
BP gives one group while other algorithms report increasing number of groups when system size increases. Thus from 
this result we can guess that the large NMI found in Fig. may come from a large number of group. 

Recall that in computing NMI of two partitions, we use ^ to approximate Pa, which is fine with n —>■ oo but leads 
to a finite size effect with n finite. Since NMI can be seen as a function of entropies (as in Eq. (§) , we can express 
the finite size effect of NMI as the finite size effect of entropy, which means that entropy with an infinite system size 
Hao{{Pa}) is different from entropy with a finite system size {Hn{{na}))■, where the expectation is taken over random 
instances. Actually this effect comes from the fluctuations of Ua around its mean value PaU and the concavity of 
entropy, as Jensen’s inequality implies that 

{Hn{{na})) < i/oo({Pa}) = ^oo({(^)}). 


More precisely, we have 


(7J„({na})) 



ria .1 ^^(ccloga:) 


^ n 



Assuming further on the distribution that the random variable na follows, for example the Bernoulli distribution as 
in [16j . the mean entropy at finite-size systems can be estimated by inserting the first and second moment into last 
equation: 


9 / 

(iL„({na})) =-'^[Pa logPa 

a=l 


1 -Pa 
2n 


= H^{{Pa}) - 


g-1 

2n 


Obviously using Eq. @ the finite size correction for mutual information is 

Qa+Qb- qaQb - 1 


{I^{PA,PB))-In(.PA,PB)=- 


2n 


( 7 ) 
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FIG. 2: (Color online) Normalized Mutual Information between two random partitions A and B. Partition A always has 
QA = 10 groups, each of which has expected size While B has a varying number of groups qs, with expected group size 
^. In other words group labelling for each node is chosen independently according to ^ for partition A and ^ for partition 
B. The lines are theoretical estimate ([^ and points with error-bars are experimental data, with each point averaged over 10 
random instances. From top to bottom, system size are n = 1000,2000,4000 and 8000. 


In our system, if we treat number of nodes in a network as sample size, the difference between NMI estimated using 
([^ in our network and in the network with an infinite number of nodes is expressed as 


(NMI„(Pa,-Pb)) - NMIoo(Pa,Pb) 


1 gg^b - gq - gfc + 1 

2n H{Pa) + H{Pb) ■ 


( 8 ) 


One thing we can infer from last equation is that in finite-size networks, even two random partitions A and B have a 
non-vanishing NMI, and its value is 


NMir‘'°“(gA,gB) 


1 gggb - gg - gb + 1 

2n H{Pa) + H[Pb) ■ 


(9) 


Note that in the last equation we put « instead of = because NMI™‘^°™(gA, gs) represents NMI of an instance instead 
of the ensemble average. 

To test our theory on the finite-size correction, in Fig. we compare NMI of two random partitions A and B with 
qA = 10 groups in partition A and a varying number of groups in partition B. We can see that Eq. © gives a good 
estimate of NMI between two random partitions. From the figure we see that for the same n, the finite-size correction 
is smaller with fewer states. This is consistent with Eq. © whose right hand side is an increasing function of qa and 
qb- Moreover we can see from the figure that for the same q, the finite-size correction is smaller with a larger system 
size, due to the ^ dependence in Eq. ©. These two properties can also be used to understand the phenomenon we 
saw in Fig. where NMI of Infomap and Label Propagation change slowly with system size, because number of groups 
given by these two algorithms increases in system size. We note that in a similar finite-size effect for Mutual 
Information has been studied, though in a different context. 

From above analysis, we see that as a metric for accuracy of community detection, NMI prefers a large number 
of partitions, which gives a systematic bias to evaluation results. One way to fix this bias is to consider statistical 
significance of the NMI, by comparing it to NMI of a null model. In this article we choose a random configuration 
C, which has the same group-size distribution as the detected partition, as a null model, and define the relative 
Normalized Mutual Information (rNMI) as 


rNMI(^, B) = NMI(A, B) - (NMI(A, C)) , 


( 10 ) 


where (NMI(^, C)) is the expected NMI between the ground-true configuration A and a random partition C, averaged 
over realizations of C. 
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So if partition B has a large number of groups, though NMI(A, B) could be large, a random configuration having 
same distribution as B also has a large number of groups and hence a large NMI(yl, C) and finally results to a small 
rNMI(yl,B). 

Actually the idea of computing statistical signihcance by comparing score of a partition to expected score of a null 
model has been used everywhere in science. For example, the well-known metric for community structures, Modularity 
[18| , compares the number of internal edges of a partition to the expected number of internal edges in random graphs 
which act as a null model. 

An easy way to compute (NMI(A, C)) is to generate several random configurations, compute NMI for each random 
configuration then take the average. In practice we find that usually 10 realizations of C are already enough. If we 
really care about the computational speed, we can use expression 


(NM^A,^)) 


1 gggfe - gq - gb + 1 

2n H{Pa)+H{Pb) ■ 


However as explained in Eq. Q, this expression is only a first-order approximation adopting the Bernoulli distribution, 
hence is obviously less accurate as the simulation value, as shown in Fig. 

In Fig. [^left we plot the rNMI given by the same four algorithms used in Fig. where we can see that now all 
algorithms report zero rNMI, telling us correctly that no one has found useful information about the planted partition 
of SBM networks in the un-detectable phase. 

To test the accuracy of the proposed metric, in Fig. [fright we compare the rNMI and overlap (Eq. <§) between the 
planted partition and the one detected by Belief Propagation (BP) algorithm, for benchmark networks generated by 
stochastic block model. In this benchmark the detectability transition happens at e* = 0.2. It is known [5] that with 
e < e* the planted configuration is detectable, and BP algorithm is supposed to find a partition that is correlated with 
the planted configuration; with e > e* the planted configuration is un-detectable, which means that partition given 
by BP should be not correlated with the planted partition. In this case Qa = Qb, so overlap defined in is a good 
metric and we can test whether rNMI gives the same information as overlap tells. In the hgure we see that the value 
of rNMI and overlap are consistent: they are both high in detectable phase with e < 0.2 and low in un-detectable 
phase with e > 0.2. Note that the overlap is not perfectly zero in un-detectable phase, because in maximizing overlap 
over permutations the effect of noise has been induced. So in this sense our metric which reports perfectly zero values 
in un-detectable phase, is a better metric for similarity of two partitions in undetectable phase than overlap, even 
when qA = Qb- 


In Fig. 1^ we compare NMI and rNMI for three algorithms, Louvain, Infomap and Modularity BP, on benchmark 
networks generated by the LFR model [3] with different system sizes, the LFR model is also a planted model for 
generating benchmark networks with community structures. However compared with the SBM model, networks 
generated by the LFR model has a power-law degree distribution and group-size distribution. From left panel of 
Fig. I^we can see that if we use the NMI as a metric for accuracy of these algorithms, we may conclude that Infomap 
works better than Louvain in whole set of benchmarks. However from Fig. right we can see that Louvain actually 
works better than Infomap because it gives a larger rNMI than Infomap. Moreover the difference of rNMI between 
Louvain and Infomap are larger with system size increases. So Fig. also tells us that using NMI may give a wrong 
estimate of performance of community detection algorithms. 

As a conclusion, in this article we showed analytically and numerically that using normalized mutual information 
as a metric for accuracy of community detection algorithms has a systematic error when number of groups given by 
algorithms are much different. We proposed to fix this problem by using the relative normalize mutual information 
which considers the statistically significance of NMI by comparing the NMI of two partitions to the expected value 
of random partitions. We note that there are other ways to estimate finite-size effect of entropy, e.g. a Bayesian 
estimate proposed in m- We put it in future work to refine (NMI(A, C)) in expression of rNMI ( |l0| using Bayesian 
approaches. 

Implementation of rNMI and examples of using it can be found at |20) . 
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FIG. 3: (Color online) (Left) Relative Normalized Mutual Information given by the same four algorithms on the same set of 
networks used in Fig[^ Each point is averaged over 10 realizations. (Right) Relative Normalized Mutual Information compared 
with overlap § in networks generated by stochastic block model with 10000 nodes, 6 groups, average degree 6 and different 

C “ Pout/Pin. 



FIG. 4: (Color online) Normalized Mutual Information (left) and Relative Normalized Mutual Information (right) for three 
algorithms, Infomap m, Louvain m and Modularity BP m on LFR benchmarks [3] with different system sizes. The 
parameters of networks are: average degree c = 8, mixing parameter g. = 0.45, maximum degree is 50, community sizes range 
from 200 to 400, exponent of degree distribution is —2. and exponent of community size distribution is —1. 
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